A machine-readable specification for genomics assays

Abstract Motivation Understanding the structure of sequenced fragments from genomics libraries is essential for accurate read preprocessing. Currently, different assays and sequencing technologies require custom scripts and programs that do not leverage the common structure of sequence elements present in genomics libraries. Results We present seqspec, a machine-readable specification for libraries produced by genomics assays that facilitates standardization of preprocessing and enables tracking and comparison of genomics assays. Availability and implementation The specification and associated seqspec command line tool is available at https://www.doi.org/10.5281/zenodo.10213865.


Introduction
The proliferation of genomics assays (Ogbeide et al. 2022) has resulted in a corresponding increase in software for processing the data (Zappia et al. 2018).Frequently, custom scripts must be created and tailored to the specifics of assays, where developers reimplement solutions for common preprocessing tasks such as adapter trimming, barcode identification, error correction, and read alignment (Cheow et al. 2016, Ma et al. 2020, Healey et al. 2022, Wu et al. 2022).When software tools are assay specific, parameter choices in these methods can diverge, making it difficult to perform apples-to-apples comparisons of data produced by different assays.Furthermore, the lack of preprocessing standardization makes reanalysis of published data in the context of new data challenging.
While genomics protocols can vary greatly from each other, the libraries they generate share many common elements.Typically, sequenced fragments will contain one or several 'technical sequences' such as barcodes and unique molecular identifiers (UMIs), as well as biological sequences that may be aligned to a genome or transcriptome.Standard library preparation kits generally require that DNA from the libraries is cut, repaired, and ligated to sequencing adapters (Fig. 1).Primers bind to the sequencing adapters, and initiate DNA sequencing whereby reads are subsequently generated.Illumina sequencing employs a sequencing by synthesis approach where fluorescently labeled nucleotides are incorporated into single-stranded DNA, and imaged, while PacBio uses zero-mode waveguides for single-molecule detection of dNTP incorporation.Oxford Nanopore on the other hand binds sequencing adapters to pores in a flow cell and DNA is sequenced by changes in electrical resistance across the pore (Iizuka et al. 2022).
Many single-cell genomics assays introduce additional library complexity further complicating preprocessing.For example, the inDropsv3 (Klein et al. 2015) assay produces variable length barcodes while the 10× Genomics scRNA-seq assay (Zheng et al. 2017) produces fixed-length barcodes that are derived from a known list of possibilities.
Current file formats such as FASTQ, Genbank, FASTA, and workflow-specific files (Parekh et al. 2018) lack the flexibility to annotate sequenced libraries that contain these complex features.In the absence of sequence annotations, processing can be challenging, limiting the reuse of data that is stored in publicly accessible databases such as the Sequence Read Archive (Katz et al. 2022).To facilitate utilization of genomics data, a database of assays along with a description of their associated library structures was assembled in Chen (2020).While this database has proved to be very useful, the HTML descriptors are not machine readable.Moreover, the lack of a formal specification limits the utility and expandability of the database.

Results
The seqspec specification defines a machine-readable file format, based on YAML, that enables sequence library annotation.Assay-and sequencer specific molecules are annotated by Regions which can be nested and appended to create a seqspec in a manner that assumes perfect end-to-end sequencing of a perfectly constructed library.Regions are annotated with a variety of properties that simplify the downstream identification of library elements.The following are a list of properties that can be associated with a Region:  � Sequence type: The type of sequence (fixed, onlist, random, joined).� Minimum length: The minimum length of the sequence for the Region.� Maximum length: The maximum length of the sequence for the Region.� Onlist: The list of permissible sequences from which the Sequence is derived.
Importantly, Regions, known as meta Regions, can contain Regions; a property that is useful for grouping and identifying library elements that are sequenced together.The YAML format is a natural language to represent nested meta-Regions in a human-readable fashion.Python-style indentation and syntax can be used to create a human-readable file format without the excessive grouping delimiters of alternative languages such as JSON.In addition, nested Regions allow Assays to be represented as an Ordered Tree where the ordering of subtrees is significant: atomic Regions are 'glued' together in an order that is concordant with the design of the sequencing library in the 5 0 to 3 0 direction (Supplementary Fig. S1).
A key feature of seqspec files is that they are machinereadable, and Region data can be parsed, processed, and extracted with the seqspec command-line tool.The tool contains eleven subcommands that enable various tasks such as specification checking, finding, formatting, and indexing, 1) seqspec check: check the correctness of attributes against the seqspec schema.2) seqspec find: print Region metadata.
3) seqspec format: auto populate Region metadata for meta Regions.4) seqspec index: extract the 0-indexed position of Regions.5) seqspec info: get info about the seqspec file.6) seqspec init: initialize a seqspec with a newickformatted string.7) seqspec modify: modify Region attributes.8) seqspec onlist: get the path to the onlist file for the specific region type.9) seqspec print: print html, markdown, ascii, read diagram that visualizes the seqspec.10) seqspec split: split seqspec into modalities.11) seqspec version: get seqspec version and seqspec file version.To illustrate how seqspec can be used to facilitate processing and analysis of single-cell RNA-seq reads, we implemented in the seqspec index command the facility to produce the relevant technology string for three single-cell RNA-seq preprocessing tools: kallisto bustools (Melsted et al. 2021), simpleaf/alevin-fry (He et al. 2022), and STARsolo (Kaminow et al. 2021) (Fig. 2).Regions associated with barcodes, UMIs, and cDNA are extracted, positionally indexed and formatted on a per-tool basis.The modularity of seqspec makes it simple to produce tool-compatible technology strings for other assay types.

Discussion
The seqspec specification and associated tool enable the annotation of a sequence library that has been generated by an assay to be processed with a sequencer-specific kit for sequencing.Associating seqspecs with sequencing data can greatly facilitate reprocessing and interpretation.For example, seqspec can help in investigating differences between sequence reads and library structure, aiding in the study of sequencing artifacts.In terms of facilitating preprocessing, seqspec innovates beyond existing methods such as kbpython's technology string or the read geometry string in simpleaf (He et al. 2022) by providing both annotation of library structures as well as a suite of tools for format validation.
Standardized annotation of sequencing libraries in a human-and machine-readable format serves several purposes including the enablement of uniform processing, organization of sequencing assays by constitutive components, and transparency for users.The flexibility of seqspec should allow it to be used for all current sequence census assays (Wold and Myers 2008), and specifications should be readily adaptable to different sequencing platforms; our initial release of seqspec contains specifications for 49 assays (see https://igvf.github.io/seqspec/).In the future, we envision that seqspec could be extended to describe sequencer-or protocol-specific steps as well as utilized to annotate engineered sequences such as DNA constructs.
Comparison of seqspecs for different assays, immediately reveals shared similarities and differences that can be visualized with seqspec print.For example, the SPLiT-seq singlecell RNA and the multimodal SHARE-seq single-cell assays are aimed at different modalities and utilize different protocols to produce libraries, but the resultant structures are very similar (Fig. 1) since they both rely on split-pool barcoding (Rosenberg et al. 2018).The seqspec for the sci-CAR-seq assay (Cao et al. 2018), from which split-pool assays such as SHARE-seq are derived, shows that the cell barcoding is encoded in the Illumina indices.It should be possible to A machine-readable specification for genomics assays develop an ontology of assays by comparing the seqspec specifications of assays and quantifying their similarities and differences.
In demonstrating that seqspec can be used to define options for preprocessing tools, we have shown that seqspec is immediately useful for uniform processing of genomics data.The preprocessing applications will hopefully incentivize data generators to define and deposit seqspec files alongside sequencing reads in public archives such as the Sequence Read Archive.While seqspec is not a suitable format for general metadata storage, the precise specification of sequence elements present in reads, including sequencer-specific constructs, should be helpful in identifying batch effects even when metadata is missing or inaccurate.

�
Region ID: unique identifier for the Region in the seqspec.� Region type: the type of region.� Name: A descriptive name for the Region.� Sequence: The specific nucleotide sequence for the Region.

Figure 1 .
Figure 1.The structure of molecules in genomic libraries.Sequencing libraries are constructed by combining Atomic Regions to form an adapter-insertadapter construct.The seqspec for the assay annotates the construct with Regions and meta Regions.

Figure 2 .
Figure2.Uniform processing enabled with seqspec.The seqspec index command produces a technology string that identifies appropriate sequence elements and can be passed into processing tools.